In late 2019, a novel coronavirus, severe acute respiratory synfrome
coronavirus 2 (SARS-CoV-2), which cause a severe respiratory disease,
emerged in Wuhan, China. The virus spread all over the world, and
infected more than 750 million people (WHO
2020). It is reported that patients in different population shows
different trend of symptoms. For example, females are less likely to
infect SARS-CoV-2 (Steeg 2016). However,
mechanisms behind such a difference is still unknown. Lieberman et.
al. examined gene expression in response to SARS-CoV-2 infection with
shotgun RNA Sequencing to see the gene expression changes(Lieberman 2020). Test condition is 430
SARS-CoV-2 patients and control condition is 54 non-infected
individuals.
In the first assignment, RNASeq data of the paper by Lieberman(Lieberman 2020) were obtained from GEO, and its
ID is GSE152075. There are 35784 genes in this dataset, and 430 samples
from SARS-CoV-2 patients i.e. testing samples and 54 samples from
controls. The RNASeq data was mapped to HUGO gene symbols, cleaned
e.g. duplicates and low counts were removed, normalized with TMM
normalization. MDS plot indicates a good quality of the dataset as
positive and negative samples are well-separated. In this assignment,
the normalized data is analyzed with differential gene expression
analysis, and genes are ranked. Then thresholded over-representation
analysis reveals the notably expressed gene.
if (!requireNamespace("knitr", quietly = TRUE))
BiocManager::install("knitr")
if (!requireNamespace("edgeR", quietly = TRUE))
BiocManager::install("edgeR")
if (!requireNamespace("scales", quietly = TRUE))
BiocManager::install("scales")
if (!requireNamespace("circlize", quietly = TRUE))
BiocManager::install("circlize")
if (!requireNamespace("ComplexHeatmap", quietly = TRUE))
BiocManager::install("ComplexHeatmap")
if (!requireNamespace("grid", quietly = TRUE))
BiocManager::install("grid")
The normalized count data and groups from the first assignment are loaded.
# The normalized count data
normalized_count <- read.table("normalized_counts.txt")
# The beginning of the dataset.
knitr::kable(normalized_count[1:15,1:5], format = "html")
| POS_001 | POS_002 | POS_003 | POS_004 | POS_005 | |
|---|---|---|---|---|---|
| A1BG | 0.000000 | 1.53141 | 0.000000 | 0.000000 | 19.416692 |
| A1CF | 0.000000 | 0.00000 | 3.759022 | 0.000000 | 0.000000 |
| A2M | 265.016689 | 55.13076 | 157.878925 | 153.605438 | 89.532527 |
| A2ML1 | 7.681643 | 0.00000 | 0.000000 | 0.000000 | 3.236115 |
| A4GALT | 0.000000 | 16.84551 | 3.759022 | 0.000000 | 2.157410 |
| AAAS | 15.363286 | 0.00000 | 0.000000 | 0.000000 | 0.000000 |
| AACS | 7.681643 | 50.53653 | 3.759022 | 0.000000 | 29.125039 |
| AADACL2 | 0.000000 | 0.00000 | 7.518044 | 14.629089 | 0.000000 |
| AADACP1 | 7.681643 | 0.00000 | 0.000000 | 3.657272 | 0.000000 |
| AADAT | 0.000000 | 0.00000 | 0.000000 | 10.971817 | 9.708346 |
| AAGAB | 11.522465 | 41.34807 | 71.421419 | 0.000000 | 38.833385 |
| AAK1 | 84.498075 | 82.69614 | 186.071590 | 47.544540 | 161.805771 |
| AAMDC | 3.840822 | 6.12564 | 5.638533 | 0.000000 | 6.472231 |
| AAMP | 0.000000 | 6.12564 | 26.313154 | 0.000000 | 7.550936 |
| AAR2 | 0.000000 | 0.00000 | 11.277066 | 3.657272 | 17.259282 |
# The end of the test samples and the beginning of the controls.
knitr::kable(normalized_count[1:15, 428:432], format = "html")
| POS_436 | POS_437 | POS_438 | NEG_001 | NEG_002 | |
|---|---|---|---|---|---|
| A1BG | 0.000000 | 12.915179 | 72.8645754 | 10.604908 | 13.0599705 |
| A1CF | 0.000000 | 0.000000 | 18.2993226 | 7.574934 | 0.7255539 |
| A2M | 351.104235 | 40.037055 | 7.6524440 | 27.774758 | 87.7920241 |
| A2ML1 | 0.000000 | 10.332143 | 12.9758833 | 2.524978 | 5.8044313 |
| A4GALT | 0.000000 | 40.682814 | 2.9944346 | 4.544961 | 7.2555392 |
| AAAS | 0.000000 | 14.206697 | 3.6598645 | 17.674846 | 30.4732646 |
| AACS | 11.703475 | 43.265850 | 39.5930798 | 22.724802 | 7.2555392 |
| AADACL2 | 0.000000 | 18.081251 | 18.2993226 | 105.039086 | 26.1199410 |
| AADACP1 | 2.925869 | 10.977902 | 8.9833038 | 12.119894 | 0.7255539 |
| AADAT | 0.000000 | 2.583036 | 4.9907243 | 10.099912 | 0.7255539 |
| AAGAB | 32.184555 | 48.431921 | 17.3011777 | 31.309728 | 21.7666175 |
| AAK1 | 190.181460 | 173.709158 | 164.3611883 | 176.243466 | 264.1016261 |
| AAMDC | 0.000000 | 12.915179 | 14.3067431 | 1.009991 | 1.4511078 |
| AAMP | 0.000000 | 7.749107 | 0.6654299 | 5.554952 | 5.0788774 |
| AAR2 | 0.000000 | 13.560938 | 7.3197290 | 10.604908 | 9.4322009 |
# The end of the controls.
knitr::kable(normalized_count[1:15, 480:484], format = "html")
| NEG_059 | NEG_060 | NEG_062 | NEG_063 | NEG_065 | |
|---|---|---|---|---|---|
| A1BG | 1.187264 | 0.000000 | 0.000000 | 0.00000 | 2.256175 |
| A1CF | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 4.512351 |
| A2M | 53.426879 | 0.000000 | 11.482416 | 0.00000 | 121.833471 |
| A2ML1 | 3.561792 | 2.319754 | 0.000000 | 56.28336 | 4.512351 |
| A4GALT | 23.745280 | 4.639508 | 3.444725 | 7.18511 | 0.000000 |
| AAAS | 16.621696 | 15.078401 | 11.482416 | 0.00000 | 2.256175 |
| AACS | 22.558016 | 20.877786 | 25.261316 | 19.16029 | 29.330280 |
| AADACL2 | 5.936320 | 1.159877 | 4.592967 | 0.00000 | 135.370523 |
| AADACP1 | 5.936320 | 4.639508 | 4.592967 | 0.00000 | 18.049403 |
| AADAT | 9.498112 | 9.279016 | 9.185933 | 81.43125 | 4.512351 |
| AAGAB | 40.366976 | 28.996925 | 33.299007 | 0.00000 | 4.512351 |
| AAK1 | 54.614143 | 31.316680 | 41.336698 | 79.03621 | 185.006382 |
| AAMDC | 9.498112 | 2.319754 | 2.296483 | 0.00000 | 11.280877 |
| AAMP | 47.490560 | 42.915450 | 41.336698 | 14.37022 | 0.000000 |
| AAR2 | 11.872640 | 18.558032 | 18.371866 | 0.00000 | 6.768526 |
# The groups defined in the first assignment
samples <- read.table("samples.txt")
# The beginning of the dataset.
knitr::kable(samples[1:5, ], format = "html")
| individual | SARS.CoV.2 | |
|---|---|---|
| POS_001 | 1 | POS |
| POS_002 | 2 | POS |
| POS_003 | 3 | POS |
| POS_004 | 4 | POS |
| POS_005 | 5 | POS |
# The end of the test samples and the beginning of the controls.
knitr::kable(samples[428:432, ], format = "html")
| individual | SARS.CoV.2 | |
|---|---|---|
| POS_436 | 436 | POS |
| POS_437 | 437 | POS |
| POS_438 | 438 | POS |
| NEG_001 | 1 | NEG |
| NEG_002 | 2 | NEG |
# The end of the controls.
knitr::kable(samples[480:484, ], format = "html")
| individual | SARS.CoV.2 | |
|---|---|---|
| NEG_059 | 59 | NEG |
| NEG_060 | 60 | NEG |
| NEG_062 | 62 | NEG |
| NEG_063 | 63 | NEG |
| NEG_065 | 65 | NEG |
Note that some samples are omitted e.g.NEG_061, and further details about patients are not submitted by the author for a privacy reason.
# Create DGEList object.
filtered_data_matrix <- as.matrix(normalized_count)
d <- edgeR::DGEList(counts=filtered_data_matrix, group=samples$cell_type)
# Create a design model.
model_design_pat <- model.matrix(~samples$SARS.CoV.2 + samples$individual)
# Estimate dispersion.
d <- edgeR::estimateDisp(d, model_design_pat)
# Calculate normalization factors.
d <- edgeR::calcNormFactors(d)
# MDS plot.
limma::plotMDS(d,
labels=rownames(samples),
col = c("darkgreen","blue")[factor(samples$SARS.CoV.2)],
main = "MDS Plot")
Figure 1: A MDS plot of samples. Blue: SARS-CoV-2 positive patients, green: SARS-CoV-2 negative patients.
There is a clear separation between positive(blue) and negative(green) samples. This indicates that there is not big patient factor. Quasi likelihood test is applied here, because there are 484 samples, which is too much for likelihood ratio test.
# Create DGEList object.
filtered_data_matrix <- as.matrix(normalized_count)
d <- edgeR::DGEList(counts=filtered_data_matrix, group=samples$cell_type)
# Estimate Dispersion.
d <- edgeR::estimateDisp(d, model_design_pat)
# Fit the model.
fit <- edgeR::glmQLFit(d, model_design_pat)
# Obtain differential expression with Quasi likelihood test.
qlf.pos_vs_neg <- edgeR::glmQLFTest(fit, coef="samples$SARS.CoV.2POS")
# Get the result.
qlf_output <- edgeR::topTags(qlf.pos_vs_neg, sort.by = "PValue", n = nrow(normalized_count))
knitr::kable(head(qlf_output), format = "html", digits = 64)
|
|
|
|
# The number of genes which pass the threshold (0.01).
length(which(qlf_output$table$PValue < 0.01))
## [1] 4272
There are 4272 genes showed significant differential expression. The threshold is 0.01 here, because the sample size is very large and the p-values tend to be higher.
For multiple hypothesis testing, Bonferroni method is not proper in this case, because there is a huge number of genes in this dataset and Bonferroni method tends to be too strict with large hypothesis. Therefore, FDR is used here. Threshold is set 0.01 for the same reason as p-value above.
# The number of genes which pass the threshold (0.01) with FDR.
length(which(qlf_output$table$FDR < 0.01))
## [1] 3132
3132 genes passes the correction with FDR. Correction with Bonferroni method is shown below.
# The number of genes which pass the threshold (0.01) with Bonferroni method.
length(which(qlf_output$table$PValue < 0.01 / length(qlf_output$table$PValue)))
## [1] 862
There are 862 genes passed the correction with Bonferroni method and this is much smaller than the one with FDR. However, there are still too many genes with 0.01 threshold and this causes some errors. Therefore, 0.001 is used as threshold instead in later cases.
# The number of genes which pass the threshold (0.001).
length(which(qlf_output$table$PValue < 0.001))
## [1] 2698
# The number of genes which pass the threshold (0.001) with FDR.
length(which(qlf_output$table$FDR < 0.001))
## [1] 1837
There are 2698 genes with p-value which pass 0.001 threshold, and 1837 genes which pass the correction.
# Apply grey for all at first.
color_volcano <- rep("grey", length(qlf_output$table$PValue))
# Red for up-regulated ones.
# Note 0.001, not 0.01 threshold is used here for the reason mentioned above.
color_volcano[qlf_output$table$logFC > 2 & qlf_output$table$FDR < 0.001] <- 'red'
# Blue for down-regulated ones.
# Note 0.001, not 0.01 threshold is used here for the reason mentioned above.
color_volcano[qlf_output$table$logFC < -2 & qlf_output$table$FDR < 0.001] <- 'blue'
# Volcano plot.
plot(qlf_output$table$logFC,
-log10(qlf_output$table$PValue),
col = scales::alpha(color_volcano, 0.3),
pch = 16,
xlab = "log2 fold change",
ylab = "-log10 p",
main = "Volcano Plot for SARS-CoV-2 Positive/Negative Patients"
)
legend("topright",
legend=c("Upregulated Genes","Downregulated Genes", "Not Significant Genes"),
fill = c("red", "blue", "grey"))
Figure 2: A volcano plot for SARS-CoV-2 positive/negative patients. Genes whose log2 fold change is more than 2 and FDR is less than the threshold is defined as up-regulated genes and colored in red. Down-regulated genes are those which has less than -2 logs fold change and less than the threshold FDR, and are shown in blue. Note that the threshold is set 0.001, not 0.01.
# Create a numerical matrix of genes with FDR less than the threshold.
# Note 0.001, not 0.01 threshold is used here for the reason mentioned above.
# Here is the error given when 0.01 threshold is used. This is probably because of too many genes and patients.
# Error in getFromNamespace(device_info[3], ns = device_info[2])(temp_image) :
# unable to open /var/folders/cc/myvsc01x5gs6_76g6xsmrb6m0000gn/T//RtmpXOogRM/.heatmap_body_e4228f848277c0ae2290f1a79e9515a1_1_144e32682e9a3.png
tophits <- normalized_count[rownames(normalized_count) %in%
rownames(qlf_output$table[which(qlf_output$table$FDR < 0.001),]),]
heatmap_matrix <- t(scale(t(tophits)))
# Set a color table.
if(min(heatmap_matrix) == 0){
heatmap_col <- circlize::colorRamp2(c(0, max(heatmap_matrix)), c( "white", "red"))
}else{
heatmap_col <- circlize::colorRamp2(c(min(heatmap_matrix), 0, max(heatmap_matrix)), c("blue", "white", "red"))
}
# Build a heatmap.
heatmap_tophits <- ComplexHeatmap::Heatmap(as.matrix(heatmap_matrix),
cluster_rows = TRUE,
cluster_columns = TRUE,
show_row_dend = TRUE,
show_column_dend = TRUE,
col = heatmap_col,
show_column_names = TRUE,
show_row_names = FALSE,
show_heatmap_legend = TRUE,
column_names_gp = grid::gpar(fontsize = 2)
)
ComplexHeatmap::draw(heatmap_tophits, column_title = "Heatmap for SARS-CoV-2 Positive/Negative Patients")
Figure 3. Heatmap for genes with significant differential expression. SARS-CoV-2 negative patients are clustered on the left, although the font size of column names are so small that they can be hardly read. Note that the threshold is set 0.001, not 0.01.
In the bottom, there is a cluster of up-regulated genes in SARS-CoV-2 negative patients and such genes are downregulated in SARS-CoV-2 positive patients. Genes at the top tend to be more down-regulated in SARS-CoV-2 negative patients, but genes in SARS-CoV-2 positive patients are not up-regulated. This is probably because most patients are SARS-CoV-2 positive (430 out of 484) and genes that are up-regulated specifically in positive patients cannot give small enough FDR.
Thresholded over-representation analysis is performed with g:Profiler. This is because g:Profiler was used in the lecture and is very handy for this assignment, especially with the web interface.
GO biological process, Reactome, and WikiPathways are used here. This is because these datasets are for human. They were released in 2024-01-17, 2024-01-25, 2024-01-01 respectively.
Up-/down-regulated genes are defined with the threshold of 0.001.
# Extract up-regulated genes.
# Note that the threshold is set 0.001, not 0.01.
up_genes <- rownames(qlf_output)[qlf_output$table$logFC > 0 & qlf_output$table$PValue < 0.001]
length(up_genes)
## [1] 929
# Write into text files.
write.table(up_genes,
"upregulated_genes.txt",
sep = "\t",
row.names = FALSE,
col.names = FALSE,
quote = FALSE
)
# Extract up-regulated genes.
# Note that the threshold is set 0.001, not 0.01.
down_genes <- rownames(qlf_output)[qlf_output$table$logFC < 0 & qlf_output$table$PValue < 0.001]
length(down_genes)
## [1] 1769
# Write into text files.
write.table(down_genes,
"downregulated_genes.txt",
sep = "\t",
row.names = FALSE,
col.names = FALSE,
quote = FALSE
)
# Extract genes for whole list.
# Note that the threshold is set 0.001, not 0.01.
whole_genes <- rownames(qlf_output)[qlf_output$table$PValue < 0.001]
# Write into text files.
write.table(whole_genes,
"whole_genes.txt",
sep = "\t",
row.names = FALSE,
col.names = FALSE,
quote = FALSE
)
There are 929 up-regulated genes and rlength(down_genes)
down-regulated genes.
The thresholded over-representation analysis is performed with the
web interface (https://biit.cs.ut.ee/gprofiler/gost). BH FDR is used
for significance threshold and the threshold is 0.001.
Figure 4: g:Profiler result of up-regulated genes from GO biological process. Top 10 results are shown here.
Figure 5: g:Profiler result of up-regulated genes from Reactome. Top 10 results are shown here. Pathways shown in grey indicates that they are above the threshold.
Figure 6: g:Profiler result of up-regulated genes from WikiPathways. Top 10 results are shown here.
Almost all the pathways for up-regulated genes are related to immune system. Note that among such immune systems, pathways involved in a response to viruses are dominant. This looks reasonable as a up-regulated genes for SARS-CoV-2 positive patients.
Figure 7: g:Profiler result of down-regulated genes from GO biological process. Top 10 results are shown here.
Figure 8: g:Profiler result of down-regulated genes from Reactome. Top 10 results are shown here.
Figure 9: g:Profiler result of down-regulated genes from WikiPathways. Top 10 results are shown here. Pathways shown in grey indicates that they are above the threshold.
For down-regulated genes, there are many pathways for translation. I
found it interesting that there are pathways for respiration system,
although the reason is unclear.
Figure 10: g:Profiler result of whole list from GO biological process. Top 10 results are shown here.
Figure 11: g:Profiler result of whole list from Reactome. Top 10 results are shown here.
Figure 12: g:Profiler result of whole list from WikiPathways. Top 10 results are shown here. Pathways shown in grey indicates that they are above the threshold.
For the whole list, these are apparently a combination of up-/down-regulated reason and not difference found.
The result mostly supports the original paper. Lieberman et.al. reported that up-regulation of anti-viral factors and down-regulation of genes related to ribosomes. However, they also state that reduction in transcription of some interferon-related genes. There is no pathways about interferon in g:Profiler result for down-regulated genes, but there is in the result for up-regulated genes. This might be because of different annotation sources form the original paper. In addition, g:Profiler result does not directly contradict to the original paper because increased expression of interferon response genes are also reported in the original paper.(Lieberman 2020)
Chen et. al. report that transcription of ACE2, which is a interferon-responsive genes, significantly increase in Asian females, moderately decrease in elderly people in all ethnic groups, and highly significant decrease in type II diabetic patients(Chen et al. 2020). Therefore, patients variation like ethnicity, sex or health condition may have a huge impact on the interferon related gene transcription but the impact is not evident in this report because details about patients are not released in Lieberman et.al. paper for privacy reason. (Lieberman 2020).